AITopics | efficient off-policy reinforcement learning

Collaborating Authors

efficient off-policy reinforcement learning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Safe and Efficient Off-Policy Reinforcement Learning

Neural Information Processing SystemsNov-21-2025, 15:21:42 GMT

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(lambda), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of off-policyness; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyse the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q* without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(lambda), which was an open problem since 1989. We illustrate the benefits of Retrace(lambda) on a standard suite of Atari 2600 games.

algorithm, efficient off-policy reinforcement learning, name change, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

Reviews: Safe and Efficient Off-Policy Reinforcement Learning

Neural Information Processing SystemsJan-20-2025, 19:06:13 GMT

In particular, it bounds the performance of off-policy importance sampling as a function of a truncation coefficient, and discusses how to choose that coefficient based on the bound they propose. The lack of a discussion of the relationship to that work makes paper 602 considerably weaker in my opinion. I would still lean towards acceptance, but only as a poster. Analyzing the convergence of the general-form off-policy updates in Equation 4 is novel and important. The theory is limited to finite state spaces (something that should be stated in the abstract) for discounted MDPs, but the empirical results show that the new Retrace algorithm can perform well in conjunction with value function approximation.

coefficient, efficient off-policy reinforcement learning, experimental result, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.58)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

Efficient Off-Policy Reinforcement Learning via Brain-Inspired Computing

Ni, Yang, Abraham, Danny, Issa, Mariam, Kim, Yeseong, Mercati, Pietro, Imani, Mohsen

arXiv.org Artificial IntelligenceJun-21-2023

Reinforcement Learning (RL) has opened up new opportunities to enhance existing smart systems that generally include a complex decision-making process. However, modern RL algorithms, e.g., Deep Q-Networks (DQN), are based on deep neural networks, resulting in high computational costs. In this paper, we propose QHD, an off-policy value-based Hyperdimensional Reinforcement Learning, that mimics brain properties toward robust and real-time learning. QHD relies on a lightweight brain-inspired model to learn an optimal policy in an unknown environment. On both desktop and power-limited embedded platforms, QHD achieves significantly better overall efficiency than DQN while providing higher or comparable rewards. QHD is also suitable for highly-efficient reinforcement learning with great potential for online and real-time learning. Our solution supports a small experience replay batch size that provides 12.3 times speedup compared to DQN while ensuring minimal quality loss. Our evaluation shows QHD capability for real-time learning, providing 34.6 times speedup and significantly better quality of learning than DQN.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2205.06978

Country:

North America > United States > California > Orange County > Irvine (0.15)
North America > United States > Tennessee > Knox County > Knoxville (0.05)
Asia > South Korea > Daegu > Daegu (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry:

Health & Medicine (0.68)
Transportation (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Safe and Efficient Off-Policy Reinforcement Learning

Munos, Remi, Stepleton, Tom, Harutyunyan, Anna, Bellemare, Marc

Neural Information Processing SystemsFeb-14-2020, 07:42:12 GMT

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(lambda), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyse the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q* without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(lambda), which was an open problem since 1989.

algorithm, behaviour policy, efficient off-policy reinforcement learning, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)

Add feedback